One prominent example of unsupervised learning is in terms of the analysis of text-as-data. We see this when using data science methods to describe and understand the “sentiments” being expressed on Twitter, or the content and bias in reporting, and even questions related to prediction: e.g., who wrote a piece of text (and was it plagarized)?
In what follows we are going to expand our focus on unsupervised learning methods – kmeans applying it to text data to give you a sense of how the basic algorithm has applicability beyond what you might typically think of as data. It also helps illustrate how much of what we do as data scientists depends on the choices that we make in terms of measurement.
The text we are going to focus on is the text of what are called “The Federalist” papers – a set of 85 essays promoting the ratification of the United States Constitution to the people of NY State written by Alexander Hamilton, James Madison, and John Jay between 1787-1788.
The underlying story was partially recounted in the much more interesting depiction on Broadway
There are two basic questions that we can use data science methods to engage.
First, what are the federalist papers about? What did the authors think that they needed to say to convince the delegates to the New York State Constitutional Convention? How much of what issues and concerns? This is an application of unsupervised learning because we are going to use the algorithm to learn about the patterns in the data rather than to tell the algorithm which patterns to predict.
Second, there is a debate about the authorship of 11 Federalist papers. The author of every Federalist Paper wrote under a pseudo-name to protect their identity – and also try to protect their identity in case they changed their mind. Perhaps Caesar was not the best choice for someone writing about the benefits of democracy? Just saying….
But Hamilton’s duel and his desire to claim ownership over his legacy meant that he “leaked” a copy of the papers with a list of who he claimed had written each paper. Madison disagreed with Hamilton’s claim on 8 of the 85 papers and it was impossible for the two to reconcile due to Hamilton’s death. But can we use the authorship of known and uncontested Federalist Papers to predict the authorship of contested papers? This pivots from unsupervised learning to the issue of supervised learning. How can we use known information to “supervise” the creation of a prediction model and make a prediction about an unknown variable. To do so we are going to use a very basic supervised learning algorithm - the liner regression.
Text data is called a “corpus”. Let’s load the complete set of Federalist Papers that are saved as a tidy-format corpus.
library("SnowballC")
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("tidytext")
load(file="FederalistPaperCorpusTidy.Rda")
glimpse(corpus.tidy)
## Rows: 85
## Columns: 3
## $ author <chr> "hamilton", "jay", "jay", "jay", "jay", "hamilton", "hamilton…
## $ text <chr> "after an unequivocal experience of the inefficiency of the s…
## $ document <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
So there are 85 observations – one for each Federalist paper (indexed by document) and each document has an associated author as well as a text containing the entire text of the document. So if we look at the first observation of the text variable we can see the entirity of Federalist Paper 1 (with line breaks denotes by \n). Note that we have preprocessed the document to remove all numbers, punctuations, and capitalizations so that the entry contains only the words contained in each.
corpus.tidy$text[1]
## fp01.txt
## "after an unequivocal experience of the inefficiency of the subsisting \n federal government you are called upon to deliberate on a new constitution \n for the united states of america the subject speaks its own importance \n comprehending in its consequences nothing less than the existence of the \n union the safety and welfare of the parts of which it is composed the \n fate of an empire in many respects the most interesting in the world \n it has been frequently remarked that it seems to have been reserved to \n the people of this country by their conduct and example to decide the \n important question whether societies of men are really capable or not \n of establishing good government from reflection and choice or whether \n they are forever destined to depend for their political constitutions \n on accident and force if there be any truth in the remark the crisis \n at which we are arrived may with propriety be regarded as the era in which \n that decision is to be made and a wrong election of the part we shall \n act may in this view deserve to be considered as the general misfortune \n of mankind this idea will add the inducements of philanthropy to those of patriotism \n to heighten the solicitude which all considerate and good men must feel \n for the event happy will it be if our choice should be directed by a \n judicious estimate of our true interests unperplexed and unbiased by \n considerations not connected with the public good but this is a thing \n more ardently to be wished than seriously to be expected the plan offered \n to our deliberations affects too many particular interests innovates \n upon too many local institutions not to involve in its discussion a variety \n of objects foreign to its merits and of views passions and prejudices \n little favorable to the discovery of truth among the most formidable of the obstacles which the new constitution \n will have to encounter may readily be distinguished the obvious interest \n of a certain class of men in every state to resist all changes which may \n hazard a diminution of the power emolument and consequence of the offices \n they hold under the state establishments and the perverted ambition of \n another class of men who will either hope to aggrandize themselves by \n the confusions of their country or will flatter themselves with fairer \n prospects of elevation from the subdivision of the empire into several \n partial confederacies than from its union under one government it is not however my design to dwell upon observations of this nature \n i am well aware that it would be disingenuous to resolve indiscriminately \n the opposition of any set of men merely because their situations might \n subject them to suspicion into interested or ambitious views candor \n will oblige us to admit that even such men may be actuated by upright \n intentions and it cannot be doubted that much of the opposition which \n has made its appearance or may hereafter make its appearance will spring \n from sources blameless at least if not respectablethe honest errors \n of minds led astray by preconceived jealousies and fears so numerous \n indeed and so powerful are the causes which serve to give a false bias \n to the judgment that we upon many occasions see wise and good men on \n the wrong as well as on the right side of questions of the first magnitude \n to society this circumstance if duly attended to would furnish a lesson \n of moderation to those who are ever so much persuaded of their being in \n the right in any controversy and a further reason for caution in this \n respect might be drawn from the reflection that we are not always sure \n that those who advocate the truth are influenced by purer principles than \n their antagonists ambition avarice personal animosity party opposition \n and many other motives not more laudable than these are apt to operate \n as well upon those who support as those who oppose the right side of a \n question were there not even these inducements to moderation nothing \n could be more illjudged than that intolerant spirit which has at all \n times characterized political parties for in politics as in religion \n it is equally absurd to aim at making proselytes by fire and sword heresies \n in either can rarely be cured by persecution and yet however just these sentiments will be allowed to be we have \n already sufficient indications that it will happen in this as in all former \n cases of great national discussion a torrent of angry and malignant passions \n will be let loose to judge from the conduct of the opposite parties \n we shall be led to conclude that they will mutually hope to evince the \n justness of their opinions and to increase the number of their converts \n by the loudness of their declamations and the bitterness of their invectives \n an enlightened zeal for the energy and efficiency of government will be \n stigmatized as the offspring of a temper fond of despotic power and hostile \n to the principles of liberty an overscrupulous jealousy of danger to \n the rights of the people which is more commonly the fault of the head \n than of the heart will be represented as mere pretense and artifice \n the stale bait for popularity at the expense of the public good it will \n be forgotten on the one hand that jealousy is the usual concomitant \n of love and that the noble enthusiasm of liberty is apt to be infected \n with a spirit of narrow and illiberal distrust on the other hand it \n will be equally forgotten that the vigor of government is essential to \n the security of liberty that in the contemplation of a sound and wellinformed \n judgment their interest can never be separated and that a dangerous \n ambition more often lurks behind the specious mask of zeal for the rights \n of the people than under the forbidden appearance of zeal for the firmness \n and efficiency of government history will teach us that the former has \n been found a much more certain road to the introduction of despotism than \n the latter and that of those men who have overturned the liberties of \n republics the greatest number have begun their career by paying an obsequious \n court to the people commencing demagogues and ending tyrants in the course of the preceding observations i have had an eye my fellowcitizens \n to putting you upon your guard against all attempts from whatever quarter \n to influence your decision in a matter of the utmost moment to your welfare \n by any impressions other than those which may result from the evidence \n of truth you will no doubt at the same time have collected from the \n general scope of them that they proceed from a source not unfriendly \n to the new constitution yes my countrymen i own to you that after \n having given it an attentive consideration i am clearly of opinion it \n is your interest to adopt it i am convinced that this is the safest course \n for your liberty your dignity and your happiness i affect not reserves \n which i do not feel i will not amuse you with an appearance of deliberation \n when i have decided i frankly acknowledge to you my convictions and \n i will freely lay before you the reasons on which they are founded the \n consciousness of good intentions disdains ambiguity i shall not however \n multiply professions on this head my motives must remain in the depository \n of my own breast my arguments will be open to all and may be judged \n of by all they shall at least be offered in a spirit which will not disgrace \n the cause of truth i propose in a series of papers to discuss the following interesting \n particulars the utility of the union to your political prosperity the insufficiency \n of the present confederation to preserve that union the necessity of a \n government at least equally energetic with the one proposed to the attainment \n of this object the conformity of the proposed constitution to the true \n principles of republican government its analogy to your own state constitution \n and lastly the additional security which its adoption will afford to \n the preservation of that species of government to liberty and to property in the progress of this discussion i shall endeavor to give a satisfactory \n answer to all the objections which shall have made their appearance that \n may seem to have any claim to your attention it may perhaps be thought superfluous to offer arguments to prove the \n utility of the union a point no doubt deeply engraved on the hearts \n of the great body of the people in every state and one which it may \n be imagined has no adversaries but the fact is that we already hear \n it whispered in the private circles of those who oppose the new constitution \n that the thirteen states are of too great extent for any general system \n and that we must of necessity resort to separate confederacies of distinct \n portions of the whole this doctrine will in all \n probability be gradually propagated till it has votaries enough to countenance \n an open avowal of it for nothing can be more evident to those who are \n able to take an enlarged view of the subject than the alternative of \n an adoption of the new constitution or a dismemberment of the union it \n will therefore be of use to begin by examining the advantages of that \n union the certain evils and the probable dangers to which every state \n will be exposed from its dissolution this shall accordingly constitute \n the subject of my next address"
To analyze this we need to convert the text into tokens that we can analyze by breaking apart the characters in each text into the separate words. To do so we are going to do the following manipulation:
tokens <- corpus.tidy %>%
# tokenizes into words and stems them
unnest_tokens(word, text, token = "word_stems") %>%
# remove any numbers in the strings
mutate(word = str_replace_all(word, "\\d+", "")) %>%
# drop any empty strings
filter(word != "")
So the unnest_tokens breaks the document up into words and extracts the “stem” - i.e, the portion of the work with lexical meaning (e.g., not suffixes or prefixes) – which we then mutate to replace all numbers in the string with an empty string and then we conclude by filtering out the empty strings.
So what did this give us?
tokens
## # A tibble: 187,105 × 3
## author document word
## <chr> <int> <chr>
## 1 hamilton 1 after
## 2 hamilton 1 an
## 3 hamilton 1 unequivoc
## 4 hamilton 1 experi
## 5 hamilton 1 of
## 6 hamilton 1 the
## 7 hamilton 1 ineffici
## 8 hamilton 1 of
## 9 hamilton 1 the
## 10 hamilton 1 subsist
## # … with 187,095 more rows
So this tidy tibble has every observation in a word in a document. There are therefore 187,105 words that are used in the 85 federalist papers.
But what are the Federalist Papers about? Can we deduce the meaning based on the words being used? To start, what if we look at the distributions of words to characterize meaning. To do so let’s create a table of the frequency of each word sorted in decreasing order (i.e., arrange(-n)):
tokens %>%
count(word) %>%
arrange(-n)
## # A tibble: 4,980 × 2
## word n
## <chr> <int>
## 1 the 17388
## 2 of 11507
## 3 to 6905
## 4 and 5032
## 5 in 4390
## 6 a 3954
## 7 be 3927
## 8 it 3163
## 9 that 2750
## 10 is 2160
## # … with 4,970 more rows
Hmm. So the Federalist Papers are about “the” and “of” and “to”? It makes sense that the transition words are most-frequently used, but let’s strip out these words that are not useful for understanding content. To do so we are going to read in the dataframe stop_words that is contained in the package tidytext. This is a predefined dictonary of commonly used ords in the English language that provides a standard set of words that can be used to prune the text and remove them from the analysis.
Note that the anti_join is essentially going thru the tibble tokens to remove (i.e., “anti_join”) any observations contained in the tibble stop_words. (The by defines how we are going to try to match the tibble tokens to stop_words – here we are going to look word-by-word for matches to remove.)
data("stop_words", package = "tidytext")
tokens <- anti_join(tokens, stop_words, by = "word")
So now let us see what is left and how that compares to the table we created before removing the “stop words” - i.e., all of the little words we use in English to combine ideas (e.g., the, and, or, all). Basically words that are used in nearly every expression of written English regardless of topic or content.
tokens %>%
count(word) %>%
arrange(-n)
## # A tibble: 4,675 × 2
## word n
## <chr> <int>
## 1 govern 1026
## 2 power 905
## 3 constitut 672
## 4 nation 566
## 5 ani 540
## 6 peopl 522
## 7 author 393
## 8 object 375
## 9 union 361
## 10 everi 348
## # … with 4,665 more rows
Much better, and much different!
Now lets use this tibble to create a new tibble that is a count by document and word stem. The tibble resulting dtm – document-term matrix, not “dead-to-me” – has each observation as a word-stem in a document with the variable n denoting the frequency with which each word stem appears.
dtm <- tokens %>%
count(document, word)
glimpse(dtm)
## Rows: 38,829
## Columns: 3
## $ document <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ word <chr> "abl", "absurd", "accid", "accord", "acknowledg", "act", "act…
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 1, 2, 1…
So using this we can then try to visualize the most-frequently used words using a wordcloud. To produce a wordcloud we need to provide a list of potential words and the frequency which which each word appears (freq) and then we can also limit the visualization to include words that are mentioned at least min.freq times such that no more than max.words are plotted in the resulting word cloud.
So if we did not like the table we produced earlier of the most frequently used words we could also depict this graphically. So what are the federalist papers about?
library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(words = dtm$word,
freq = dtm$n,
min.freq = 20,
max.words = 200,
random.order=FALSE,
rot.per=0.35)
NOTE: How does your understanding/impression/characterization change as you change these parameters? Try it out. What does that mean in terms of the usefulness of a wordcloud?
# INSERT CODE HERE
As an aside, if you wanted to use piping, you could do the same using the following code – here using only the top 20 words. Note that the {} is required to deal with the fact that we are passing wordcloud thru dtm and making an internal reference being reflected in the use of .
dtm %>%
{ wordcloud(.$word, .$n, max.words = 20) }
So what if we wanted to focus on a particular paper – say Federalist 10? One way is to create a new tibble containing just Federalist 10. This may be useful if you were going to do a lot of subsequent work focusing on Federalist 10 and you wanted to avoid having to filter every time.
dtm10 <- dtm %>%
filter(document == 10) %>%
arrange(-n)
wordcloud(words = dtm10$word,
freq = dtm10$n,
min.freq = 3,
max.words = 50,
random.order=FALSE,
rot.per=0.35)
But we could also produce a wordcloud of the top 20 words in Federalist 10 by applying it to the appropriately filtered dtm tibble:
dtm %>%
filter(document == 10) %>% {
wordcloud(.$word, .$n, max.words = 20)
}
So what are the 10 most frequently used words?
dtm10 %>%
top_n(10,
wt=n)
## # A tibble: 11 × 3
## document word n
## <int> <chr> <int>
## 1 10 parti 20
## 2 10 faction 17
## 3 10 govern 17
## 4 10 public 14
## 5 10 citizen 13
## 6 10 major 12
## 7 10 passion 10
## 8 10 form 9
## 9 10 properti 8
## 10 10 republ 8
## 11 10 union 8
How else can we summarize/describe data? Cluster Analysis via kmeans?
But using what data? Should we focus on the number of words being used? The proportion of times a word is used in a particular document? Or some other transformation that tries to account for how frequently a word is used in a particular document relative to how frequently it is used in the overall corpus?
We are going to use the text analysis function bind_tf_idf that will take a document-term matrix and compute the fraction of times each word is used in each document (tf = “term frequency”). It also computes a transformation called tf-idf that balances how frequently a word is used relative to its uniqueness in a document.
For word w in document d we can compute the tf-idf using: \[
tf-idf(w,d) = tf(w,d) \times log \left( \frac{N}{df(w)}\right)
\] where tf is the term frequency (word count/total words), df(w) is the number of documents in the corpus that contain the word, and N is the number of documents in the corpus. The inverse-document-frequency idf for each word w is thereore the number of documents in the corpus N over the number of documents containing the word.
NOTE: In what follows we are going to focus on the tf as it is easier to grasp conceptually, but if you are interested you should replicate the analyses using the tf_idf transformation to see how measurement choices can matter (i.e., replace tf with tf_idf).
So let us create an new document-term-matrix object that also includes the tf, idf and tf_idf associated with each word. Using the resulting tibble – dtm.tfidf let us look at the values associated with Federalist 10 written by Madison:
dtm.tfidf <- bind_tf_idf(dtm, word, document, n)
dtm.tfidf %>%
filter(document == 10) %>%
top_n(10,
wt=tf_idf)
## # A tibble: 10 × 6
## document word n tf idf tf_idf
## <int> <chr> <int> <dbl> <dbl> <dbl>
## 1 10 cure 5 0.00466 2.50 0.0116
## 2 10 democraci 5 0.00466 3.06 0.0142
## 3 10 faction 17 0.0158 1.26 0.0200
## 4 10 factious 4 0.00372 2.65 0.00987
## 5 10 injustic 4 0.00372 2.50 0.00930
## 6 10 major 12 0.0112 1.15 0.0128
## 7 10 parti 20 0.0186 0.705 0.0131
## 8 10 passion 10 0.00931 0.946 0.00881
## 9 10 properti 8 0.00745 1.22 0.00912
## 10 10 republ 8 0.00745 1.18 0.00882
So kmeans took a matrix where each column was a different variable that we were interested in using to characterize patterns but the data we have is arranged in a one-term-per-document-per-row. To transform the data into the format we require we need to “recast” the data so that each word is a separate variable – meaning that the number of variables is the number of of unique word stems. H
cast_dtm(dtm.tfidf, document, word, tf)
## <<DocumentTermMatrix (documents: 85, terms: 4675)>>
## Non-/sparse entries: 38829/358546
## Sparsity : 90%
## Maximal term length: 18
## Weighting : term frequency (tf)
So now let us create this recasted object for reach and use it for analysis.
castdtm <- cast_dtm(dtm.tfidf, document, word, tf)
set.seed(42)
km_out <- kmeans(castdtm,
centers = 4,
nstart = 25)
So how many documents are associated with each cluster?
table(km_out$cluster)
##
## 1 2 3 4
## 46 29 2 8
So let’s tidy it up to see the centroids – here mean frequency– associated with each word in each cluster.
tidy(km_out)
## # A tibble: 4 × 4,677
## abl absurd accid accord acknowledg act actuat add addit
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.00158 0.000287 3.84e-5 1.05e-3 0.000480 0.00135 1.64e-4 3.20e-4 9.17e-4
## 2 0.000789 0.0000996 0 6.54e-4 0.000160 0.00243 1.82e-4 7.56e-4 1.06e-3
## 3 0 0 0 1.69e-3 0 0.00458 7.31e-4 0 0
## 4 0 0.000545 1.49e-4 1.31e-3 0.000186 0.00255 0 1.86e-4 5.77e-4
## # … with 4,668 more variables: address <dbl>, admit <dbl>, adopt <dbl>,
## # advantag <dbl>, adversari <dbl>, advoc <dbl>, affect <dbl>, afford <dbl>,
## # aggrand <dbl>, aim <dbl>, alreadi <dbl>, altern <dbl>, alway <dbl>,
## # ambigu <dbl>, ambit <dbl>, ambiti <dbl>, america <dbl>, amus <dbl>,
## # analog <dbl>, angri <dbl>, ani <dbl>, animos <dbl>, anoth <dbl>,
## # answer <dbl>, antagonist <dbl>, apt <dbl>, ardent <dbl>, argument <dbl>,
## # arriv <dbl>, artific <dbl>, astray <dbl>, attain <dbl>, attempt <dbl>, …
Very hard to summarize given that there are 4677 columns! (Good luck trying to graph that!)
How can we summarize? Can we augment? No because augment does not work for DocumentTermMatrix objects. So let’s do it by hand.
First we are going to create a new tibble containing all of the unique words in the document-term-matrix object we analyzed called words_kmean. Then we are ging to bind_cols this to the matrix of centers (i.e., the mean frequency with which each term appears in documents associated with each cluster). So we can see that we have effectively flipped the rows and columns.
words_kmean <- tibble(word = colnames(castdtm))
words_kmean <- bind_cols(words_kmean, as_tibble(t(km_out$centers))) # Binding the transpose of the matrix of centers
words_kmean
## # A tibble: 4,675 × 5
## word `1` `2` `3` `4`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 abl 0.00158 0.000789 0 0
## 2 absurd 0.000287 0.0000996 0 0.000545
## 3 accid 0.0000384 0 0 0.000149
## 4 accord 0.00105 0.000654 0.00169 0.00131
## 5 acknowledg 0.000480 0.000160 0 0.000186
## 6 act 0.00135 0.00243 0.00458 0.00255
## 7 actuat 0.000164 0.000182 0.000731 0
## 8 add 0.000320 0.000756 0 0.000186
## 9 addit 0.000917 0.00106 0 0.000577
## 10 address 0.000369 0.000152 0.00143 0
## # … with 4,665 more rows
So now let us see what the top words are by gathering the words in each cluster according to their value (i.e., average document frequency) and then reporting the top 10 words in each cluster organized in descending overall average document frequency. (TEST: Can you make it report in descending average document frequency by cluster?)
top_words_cluster <-
gather(words_kmean, cluster, value, -word) %>%
group_by(cluster) %>%
top_n(10,
wt=value) %>%
arrange(-value)
top_words_cluster
## # A tibble: 40 × 3
## # Groups: cluster [4]
## word cluster value
## <chr> <chr> <dbl>
## 1 execut 3 0.0438
## 2 depart 3 0.0423
## 3 legisl 3 0.0366
## 4 power 3 0.0330
## 5 constitut 3 0.0268
## 6 power 4 0.0249
## 7 govern 1 0.0194
## 8 judiciari 3 0.0180
## 9 court 4 0.0175
## 10 law 4 0.0165
## # … with 30 more rows
So let’s take a look at each cluster and what the top 10 words might imply about the content of each cluster.
top_words_cluster %>%
filter(cluster==1) %>%
arrange(-value)
## # A tibble: 10 × 3
## # Groups: cluster [1]
## word cluster value
## <chr> <chr> <dbl>
## 1 govern 1 0.0194
## 2 power 1 0.0115
## 3 nation 1 0.0111
## 4 peopl 1 0.00875
## 5 union 1 0.00696
## 6 constitut 1 0.00681
## 7 ani 1 0.00661
## 8 object 1 0.00556
## 9 author 1 0.00544
## 10 feder 1 0.00482
Seems to be about power, govenance, federalism and the nature of the union proposed by the constitution. Big picture themes and concepts about the role of government and the power of the nation versus states.
top_words_cluster %>%
filter(cluster==2) %>%
arrange(-value)
## # A tibble: 10 × 3
## # Groups: cluster [1]
## word cluster value
## <chr> <chr> <dbl>
## 1 power 2 0.0111
## 2 govern 2 0.0107
## 3 constitut 2 0.0105
## 4 ani 2 0.0103
## 5 repres 2 0.00940
## 6 peopl 2 0.00874
## 7 senat 2 0.00829
## 8 elect 2 0.00718
## 9 execut 2 0.00678
## 10 bodi 2 0.00676
Looks like papers about representation and governance. The connections between people and their elected Representatives and (at the time) appointed Senators. Basically the way that the democratic part would work.
top_words_cluster %>%
filter(cluster==3) %>%
arrange(-value)
## # A tibble: 10 × 3
## # Groups: cluster [1]
## word cluster value
## <chr> <chr> <dbl>
## 1 execut 3 0.0438
## 2 depart 3 0.0423
## 3 legisl 3 0.0366
## 4 power 3 0.0330
## 5 constitut 3 0.0268
## 6 judiciari 3 0.0180
## 7 govern 3 0.0150
## 8 appoint 3 0.0105
## 9 exercis 3 0.00939
## 10 branch 3 0.00888
This looks like it is about separation of powers issues. How the executive, legislative and judicial branches were related. Executive appointments and senate confirmation. Within-government relations.
top_words_cluster %>%
filter(cluster==4) %>%
arrange(-value)
## # A tibble: 10 × 3
## # Groups: cluster [1]
## word cluster value
## <chr> <chr> <dbl>
## 1 power 4 0.0249
## 2 court 4 0.0175
## 3 law 4 0.0165
## 4 constitut 4 0.0159
## 5 author 4 0.0144
## 6 nation 4 0.0102
## 7 union 4 0.0100
## 8 jurisdict 4 0.00971
## 9 govern 4 0.00959
## 10 ani 4 0.00950
The last set look to be about courts, laws, jurisdictions and power. It looks like they are related to the power and role of the judiciary in the new proposed government.
Can we make this list a bit easier to follow? Of course. Lets summarize and then use kable (invoked via knitr) to make the resulting list of words more attractive to look at.
gather(words_kmean, cluster, value, -word) %>%
group_by(cluster) %>%
top_n(10, value) %>%
summarise(top_words = str_c(word, collapse = ", ")) %>%
knitr::kable()
| cluster | top_words |
|---|---|
| 1 | ani, constitut, feder, govern, nation, object, peopl, power, union, author |
| 2 | ani, bodi, constitut, elect, govern, peopl, power, repres, execut, senat |
| 3 | constitut, govern, power, appoint, depart, execut, branch, legisl, exercis, judiciari |
| 4 | ani, constitut, court, govern, nation, power, union, jurisdict, law, author |
Really cool. So we have used nothing more than the kmeans clustering to uncover the meaning of the Federalist Papers! (Or at least the meaning in terms of the explicit language being used.)
But recall that there were 9 papers were the authorship was unknown because both Hamilton and Madison claimed authorship. Can we use the clusters to identify authorship? Put differently, do the topics we uncover covary with the authors of the papers in ways that would let us use the former to infer the latter? Did the authors separate their writing based on topic content?
Well, let’s see how well the clusters correspond to the various authors?
table(Cluster = km_out$cluster, Author = corpus.tidy$author)
## Author
## Cluster contested hamilton hamilton.madison jay madison
## 1 3 26 3 4 10
## 2 8 19 0 1 1
## 3 0 0 0 0 2
## 4 0 6 0 0 2
Not very well. A majority of Hamilton and Madison’s known writings were classified in Cluster so it is hard to know how to interpret the 3 contested papers that were similarly classified. Nor does a similar cluster imply authorship – e.g., 4 of the papers written by Jay are in Cluster 1 but it would be wrong to attribute those papers to Hamilton or Madison!
Perhaps the 8 contested paper in Cluster 2 are more suggestive given that the other papers in that cluster were primarily written by Hamilton (19) rather than Jay (1) or Madison (1). However, kmeans is intended to summarize within-sample, not predict out-of-sample. So while we might be tempted to make a prediction based on similar content, realize that this is an interpretation that goes beyond what we have used the data to do. We will return to this later.
Before the we can use the same tools and processes to look at what specific authors are writing about. For example, if we wanted to look at Hamilton’s topics and Madison’s topics we can separately analyze the writings of each. Creating a index to identify the authorship of each doucment we can then filter the dtm.tdidf tibble defined about to extract only the papers known to be written by Hamilton (or Madison). We can then replicate what we just did to this subset of Federalist Papers.
hamilton <- c(1, 6:9, 11:13, 15:17, 21:36, 59:61, 65:85)
madison <- c(10, 14, 37:48, 58)
dtm_hamilton <- filter(dtm.tfidf, document %in% hamilton)
castHamilton <- cast_dtm(dtm_hamilton, document, word, tf)
Can you use this to summarize the 4 main topics that Hamilton wrote about? How many papers are in each cluster? What is the primary themes of each? And how does the emphaisis of Hamiliton compare to Madison in terms of focus and content?